Guided Clustering for Social Media Nowcasting

نویسنده

  • Dolan Antenucci
چکیده

The last several years have seen a growth in social media “nowcasting” applications—the use of social media data to predict real-world phenomena such as flu activity [5], unemployment behavior [1], and more. Generally these projects pick a target phenomenon that has some sort of official data (e.g. U.S. weekly UI claims), train a prediction model with this data and some features derived from a social media corpus, and finally generate a prediction with that model. Unfortunately, there is a large class of phenomena that lack the required training data to build an accurate prediction model. For instance, the economists we worked with in 2014 [1] shared an interest the U.S. Census Bureau has with indicating human migration patterns via social media— something that is traditionally only collected every ten years. It is arguable that most real-world phenomena lack the training data needed to build a supervised prediction model, as traditional survey-driven data collection methods are very expensive and thus cover very few phenomena. One important challenge of these projects is with how to pick the features derived from the social media corpus. For example, if the features are signals representing the weekly frequency of different phrases, “I lost my job”might be a good indicator for unemployment behavior. Past projects have either used data-intensive (i.e. correlation with the target) or labor-intensive (i.e. hand-filtering) processes for selecting these features. For our phenomena that lack sufficient training data, a data-intensive process is not feasible—since there is no data to test correlations against—and a labor-intensive process is still not great—for the same reasons outlined in previous work [2] (time-consuming, prone to error, etc.). Thus, choosing features for these low data phenomena is a difficult and important challenge. One possible solution is based on clustering. The features could be clustered together based on relatedness with each other. A user would then choose a cluster that is both interesting and high-quality. An interesting cluster would be one that matches what the user believes is descriptive of the target phenomena. A high-quality cluster is one where all the elements are related with each other and which appear

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ringtail: A Generalized Nowcasting System

Social media nowcasting—using online user activity to describe real-world phenomena—is an active area of research to supplement more traditional and costly data collection methods such as phone surveys. Given the potential impact of such research, we would expect general-purpose nowcasting systems to quickly become a standard tool among noncomputer scientists, yet it has largely remained a rese...

متن کامل

A Declarative Query Processing System for Nowcasting

Nowcasting is the practice of using social media data to quantify ongoing real-world phenomena. It has been used by researchers to measure flu activity, unemployment behavior, and more. However, the typical nowcasting workflow requires either slow and tedious manual searching of relevant social media messages or automated statistical approaches that are prone to spurious and low-quality results...

متن کامل

Ringtail: Feature Selection For Easier Nowcasting

In recent years, social media “nowcasting”—the use of online user activity to predict various ongoing real-world social phenomena—has become a popular research topic; yet, this popularity has not led to widespread actual practice. We believe a major obstacle to widespread adoption is the feature selection problem. Typical nowcasting systems require the user to choose a set of relevant social me...

متن کامل

Forecasting Word Model: Twitter-based Influenza Surveillance and Prediction

Because of the increasing popularity of social media, much information has been shared on the internet, enabling social media users to understand various real world events. Particularly, social media-based infectious disease surveillance has attracted increasing attention. In this work, we specifically examine influenza: a common topic of communication on social media. The fundamental theory of...

متن کامل

DEFENDER: Detecting and Forecasting Epidemics Using Novel Data-Analytics for Enhanced Response

In recent years social and news media have increasingly been used to explain patterns in disease activity and progression. Social media data, principally from the Twitter network, has been shown to correlate well with official disease case counts. This fact has been exploited to provide advance warning of outbreak detection, forecasting of disease levels and the ability to predict the likelihoo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015